Traditionally, image classification required a very large corpus of training data - often millions of images which may not be available and a long time to train on those images which is expensive and time consuming. That has changed with transfer learning which can be readily used with Cloud ML Engine and without deep knowledge of image classification algorithms using the ML toolbox in Datalab.
This notebook codifies the capabilities discussed in this blog post. In a nutshell, it uses the pre-trained inception model as a starting point and then uses transfer learning to train it further on additional, customer-specific images. For explanation, simple flower images are used. Compared to training from scratch, the training data requirements, time and costs are drastically reduced.
This notebook does all operations in the Datalab container without calling CloudML API. Hence, this is called "local" operations - though Datalab itself is most often running on a GCE VM. See the corresponding cloud notebook for cloud experience which only adds the --cloud parameter and some config to the local experience commands. The purpose of local work is to do some initial prototyping and debugging on small scale data - often by taking a suitable (say 0.1 - 1%) sample of the full data. The same basic steps can then be repeated with much larger datasets in cloud.
In [5]:
!mkdir -p /content/flowerdata
In [6]:
!gsutil -m cp gs://cloud-datalab/sampledata/flower/* /content/flowerdata
Define directories for preprocessing, model, and prediction.
In [7]:
import mltoolbox.image.classification as model
from google.datalab.ml import *
worker_dir = '/content/datalab/tmp/flower'
preprocessed_dir = worker_dir + '/flowerrunlocal'
model_dir = worker_dir + '/tinyflowermodellocal'
prediction_dir = worker_dir + '/flowermodelevallocal'
images_dir = worker_dir + '/images'
local_train_file = '/content/flowerdata/train200local.csv'
local_eval_file = '/content/flowerdata/eval100local.csv'
In [8]:
!mkdir -p {images_dir}
In order to get best efficiency, we download the images to local disk, and create our training and evaluation files to reference local path instead of GCS path. Note that the original training files referencing GCS image paths work too, although a bit slower.
In [9]:
import csv
import datalab.storage as gcs
import os
def download_images(input_csv, output_csv, images_dir):
with open(input_csv) as csvfile:
data = list(csv.DictReader(csvfile, fieldnames=['image_url', 'label']))
for x in data:
url = x['image_url']
out_file = os.path.join(images_dir, os.path.basename(url))
with open(out_file, 'w') as f:
f.write(gcs.Item.from_url(url).read_from())
x['image_url'] = out_file
with open(output_csv, 'w') as w:
csv.DictWriter(w, fieldnames=['image_url', 'label']).writerows(data)
download_images('/content/flowerdata/train200.csv', local_train_file, images_dir)
download_images('/content/flowerdata/eval100.csv', local_eval_file, images_dir)
The above code can best be illustrated by the comparison below.
In [10]:
!head /content/flowerdata/train200.csv -n 5
In [11]:
!head {local_train_file} -n 5
The following cell takes ~5 min on a n1-standard-1 VM. Preprocessing the full 3000 images takes about one hour.
In [12]:
# instead of local_train_file, it can take '/content/flowerdata/train200.csv' too, but processing will be slower.
train_set = CsvDataSet(local_train_file, schema='image_url:STRING,label:STRING')
model.preprocess(train_set, preprocessed_dir)
In [13]:
import logging
logging.getLogger().setLevel(logging.INFO)
model.train(preprocessed_dir, 30, 800, model_dir)
logging.getLogger().setLevel(logging.WARNING)
Run TensorBoard to visualize the completed training. Review accuracy and loss in particular.
In [14]:
tb_id = TensorBoard.start(model_dir)
We can check the TF summary events from training.
In [15]:
summary = Summary(model_dir)
summary.list_events()
Out[15]:
In [16]:
summary.plot('accuracy')
summary.plot('loss')
In [17]:
images = [
'gs://cloud-ml-data/img/flower_photos/daisy/15207766_fc2f1d692c_n.jpg',
'gs://cloud-ml-data/img/flower_photos/tulips/6876631336_54bf150990.jpg'
]
# set show_image to False to not display pictures.
model.predict(model_dir, images, show_image=True)
Out[17]:
We did a quick test of the model using a few samples. But we need to understand how the model does by evaluating it against much larger amount of labeled data. In the initial preprocessing step, we did set aside enough images for that purpose. Next, we will use normal batch prediction and compare the results with the previously labeled targets.
The following batch prediction and loading of results takes ~3 minutes.
In [18]:
import google.datalab.bigquery as bq
bq.Dataset('flower').create()
Out[18]:
In [19]:
eval_set = CsvDataSet(local_eval_file, schema='image_url:STRING,label:STRING')
model.batch_predict(eval_set, model_dir, output_bq_table='flower.eval_results_local')
Now that we have the results and expected results loaded in a BigQuery table, let's start analyzing the errors and plot the confusion matrix.
In [20]:
%%bq query --name wrong_prediction
SELECT * FROM flower.eval_results_local where target != predicted
In [21]:
wrong_prediction.execute().result()
Out[21]:
Confusion matrix is a common way of comparing the confusion of the model - aggregate data about where the actual result did not match the expected result.
In [22]:
ConfusionMatrix.from_bigquery('flower.eval_results_local').plot()
More advanced analysis can be done using the feature slice view. For the feature slice view, let's define SQL queries that compute accuracy and log loss and then use the metrics.
In [23]:
%%bq query --name accuracy
SELECT
target,
SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END) as correct,
COUNT(*) as total,
SUM(CASE WHEN target=predicted THEN 1 ELSE 0 END)/COUNT(*) as accuracy
FROM
flower.eval_results_local
GROUP BY
target
In [24]:
accuracy.execute().result()
Out[24]:
In [25]:
%%bq query --name logloss
SELECT feature, AVG(-logloss) as logloss, count(*) as count FROM
(
SELECT feature, CASE WHEN correct=1 THEN LOG(prob) ELSE LOG(1-prob) END as logloss
FROM
(
SELECT
target as feature,
CASE WHEN target=predicted THEN 1 ELSE 0 END as correct,
target_prob as prob
FROM flower.eval_results_local))
GROUP BY feature
In [26]:
FeatureSliceView().plot(logloss)
In [27]:
import shutil
import google.datalab.bigquery as bq
TensorBoard.stop(tb_id)
bq.Table('flower.eval_results_local').delete()
shutil.rmtree(worker_dir)
In this notebook, we covered local preprocessing, training, prediction and evaluation. We started from data in GCS in csv form plus images; used transfer learning for very fast training and then used BigQuery for model performance analysis. In the next notebook, we will use CloudML APIs that scale a lot better for larger scale. The syntax and analyses will remain the same.
In [ ]: